Introduction

This report presents a Principal Component Analysis (PCA) of employment data from Eurostat. The goal is to:

  1. Identify patterns and relationships between countries based on employment statistics for different citizen categories.
  2. Reduce dimensionality for easier visualization and interpretation.
  3. Determine the contributions of each variable (citizen category) to the principal components.
  4. Visualize countries’ similarities and differences in employment patterns.

Libraries and Data Loading

Loading the necessary libraries for data manipulation, analysis, and visualization.

# Load necessary libraries
library(tidyverse)     # For data manipulation and visualization
library(eurostat)      # For downloading Eurostat data
library(FactoMineR)    # For performing PCA
library(factoextra)    # For visualizing PCA results
library(plotly)        # For interactive plots

Data Preparation

I Downloaded the lfsa_egan dataset and filtered it for:
- Years: 2021 to 2023
- Age Group: 20-64 years (working-age population)
- Sex: Both sexes (T)
- Unit: Employment measured in thousands (THS_PER)

# Download the 'lfsa_egan' dataset from Eurostat
data <- get_eurostat("lfsa_egan")

# Filter for the years 2021-2023, age 20-64, both sexes, and unit in thousands
data_filtered <- data %>%
  filter(TIME_PERIOD >= as.Date("2021-01-01") & TIME_PERIOD <= as.Date("2023-12-31"),
         age == "Y20-64", 
         sex == "T",
         unit == "THS_PER") %>%
  drop_na(values)  # Remove rows with missing values

I Inspected the unique values in the citizen and geo columns to get a better understanding of the data.

unique(data$citizen)
## [1] "EU27_2020_FOR"  "FOR"            "NAT"            "NEU27_2020_FOR"
## [5] "NRP"            "STLS"           "TOTAL"
unique(data$geo)
##  [1] "AT"        "BE"        "CH"        "CY"        "CZ"        "DE"       
##  [7] "DK"        "EA20"      "EE"        "EL"        "ES"        "EU27_2020"
## [13] "FI"        "FR"        "HU"        "IE"        "IS"        "IT"       
## [19] "LU"        "ME"        "MT"        "NL"        "NO"        "PT"       
## [25] "RS"        "SE"        "SI"        "SK"        "UK"        "BG"       
## [31] "HR"        "LT"        "LV"        "MK"        "PL"        "RO"       
## [37] "BA"        "TR"

I also created a data frame that maps each country code and categories to its corresponding full name.

# Create a named vector mapping country codes and categories to their full names
country_names <- c(
  AT = "Austria", BE = "Belgium", CH = "Switzerland", CY = "Cyprus", CZ = "Czech Republic",
  DE = "Germany", DK = "Denmark", EA20 = "Euro Area 20", EE = "Estonia", EL = "Greece",
  ES = "Spain", EU27_2020 = "European Union 27", FI = "Finland", FR = "France", HU = "Hungary",
  IE = "Ireland", IS = "Iceland", IT = "Italy", LU = "Luxembourg", ME = "Montenegro",
  MT = "Malta", NL = "Netherlands", NO = "Norway", PT = "Portugal", RS = "Serbia",
  SE = "Sweden", SI = "Slovenia", SK = "Slovakia", UK = "United Kingdom", BG = "Bulgaria",
  HR = "Croatia", LT = "Lithuania", LV = "Latvia", MK = "North Macedonia", PL = "Poland",
  RO = "Romania", BA = "Bosnia and Herzegovina", TR = "Turkey",
  
  # Categories
  EU27_2020_FOR = "EU27 Foreign Born", FOR = "Foreign Born", NAT = "National", 
  NEU27_2020_FOR = "Non-EU27 Foreign Born", NRP = "Non-Resident Population", 
  STLS = "Short-Term Labour Supply", TOTAL = "Total"
)

Data Transformation

I decided to transform the dataset into a wide format with countries as rows and citizen categories as columns. Then I filled the missing values with 0.

pca_data <- data_filtered %>%
  pivot_wider(names_from = citizen, values_from = values, values_fill = 0)

I prepared the data for PCA by removing non-numeric columns.

pca_input <- pca_data %>% select(-geo, -age, -sex, -unit, -freq, -TIME_PERIOD)

PCA Execution

I standardized the data and perform PCA.

# Standardize the data
pca_data_scaled <- scale(pca_input)

# Perform PCA
pca_result <- PCA(pca_data_scaled, graph = FALSE)

Results and Visualizations

Scree Plot

The scree plot shows the percentage of variance explained by each principal component.

fviz_eig(pca_result)

pca_result$eig
##          eigenvalue percentage of variance cumulative percentage of variance
## comp 1 5.843290e+00           8.347556e+01                          83.47556
## comp 2 1.008833e+00           1.441191e+01                          97.88747
## comp 3 1.097596e-01           1.567994e+00                          99.45547
## comp 4 3.049431e-02           4.356330e-01                          99.89110
## comp 5 7.622619e-03           1.088946e-01                          99.99999
## comp 6 5.037309e-07           7.196156e-06                         100.00000
## comp 7 6.907112e-11           9.867303e-10                         100.00000

Interpretation:
- The first principal component (Dim.1) explains 83.48% of the variance.
- The second principal component (Dim.2) explains 14.41%.
- Together, Dim.1 and Dim.2 account for 97.89% of the total variance.


Variable Contributions

Contributions to the First Principal Component (Dim.1)

fviz_contrib(pca_result, choice = "var", axes = 1)

Interpretation:
The variables EU27_2020_FOR, FOR, NAT, NEU27_2020_FOR, and TOTAL contribute almost equally (~17%) to the first principal component. This component captures general employment trends across these categories.

Contributions to the Second Principal Component (Dim.2)

fviz_contrib(pca_result, choice = "var", axes = 2)

Interpretation:
The NRP (Non-Resident Population) category dominates the second component with 98.82% contribution, indicating that Dim.2 captures variations specific to this category.

Contributions to the First Two Components

fviz_contrib(pca_result, choice = "var", axes = 1:2)


Individual Plot (Countries)

Visualize the countries in the PCA space, colored by their country codes.

# Replace country codes with full names in pca_data$geo
pca_data$geo <- country_names[pca_data$geo]

# Convert geo to a factor
pca_data$geo <- as.factor(pca_data$geo)

# Create the PCA plot with country names
pca_plot <- fviz_pca_ind(
  pca_result,
  label = "none",
  habillage = pca_data$geo,   # This now contains the full country names
  addEllipses = TRUE
) +
  labs(
    title = "PCA of Employment Data by Country (2021-2023) - Interactive Plot",
    x = "Principal Component 1 (83.5% Variance Explained)",
    y = "Principal Component 2 (14.4% Variance Explained)",
  )

# Convert the plot to an interactive plot
interactive_plot <- ggplotly(pca_plot)

# Manually update the legend labels to remove ",1"
for (i in seq_along(interactive_plot$x$data)) {
  if (!is.null(interactive_plot$x$data[[i]]$name)) {
    interactive_plot$x$data[[i]]$name <- gsub(",1", "", interactive_plot$x$data[[i]]$name)
  }
}

# Display the interactive plot
interactive_plot

Countries are grouped based on their employment patterns. By looking at the plot I identified 4 macro-clusters:
1. Turkey on the top
2. Euro Area 20 and European Union 27 on the right
3. Germany in the middle
4. The remaining groups on the lower-left
By taking a closer look at the group on the lower-left (draw a rectangle on the plot to zoom) I saw that Italy, France and Spain were plotted closer to eachother and further away from the other EU-countries. This may show that these countries share a similar employment distributions.


Interpretation of Results

Key Findings

  • Dim.1 captures general employment trends across various citizen categories (EU27_2020_FOR, FOR, NAT, etc.).
  • Dim.2 highlights the Non-Resident Population (NRP) as a distinct group.
  • The first two components explain 97.89% of the total variance, making them highly informative.

Country Clusters

  • The PCA plot shows clusters of countries with similar employment distributions.
  • Differences between clusters can inform targeted employment policies or further investigations.

Conclusions

This PCA analysis helps to:


References

Eurostat Dataset: lfsa_egan